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Comparison data on SAT verbal and mathematic- were 
collected on pairs of examinees in three samples for later use in 
detecting instances of willful copying. Two of the samples were 
constructed with the knowledge that no examinee could possibly have 
copied from the answer sheet of any other examinee in the sample. The 
third sample was taken entirely from a single center believed to bo 
free of cheating. .In each sample the answer sheet of each examinee 
was compared with the answer sheet of every other examinee. Eight 
detection indices were developed and distributions were run for 
possible operational use in making future judgments regarding 
examinees who were actually suspected of copying. Covariance analyses 
between samples indicated statistical but not practical significance, 
and consequently it was judged that any one of the samples could 
serve the purposes of operational detection as well as either of the 
other two. Empirical tryout of the indices against known and admitted 
copiers gave some results which permitted the elimination of three of 
the indices from further use. Practical considerations removed a 
fourth, and further statistical study eliminated two others. The 
remaining two have been in successful operational use at Educational 
Testing Service for more than two years. (Author) 



N 



r - 
oc 

vO 

O' 

<JD 

O 

o 

vjJ 



GO 

00 



/~\ 

V.i * 



r 




o 




H 




U S OEPARTMFNT OF HEALTH 

EDUCATION & WELFARE 
OFFICE OF EDUCATION 

THIS OOCUMENT HAS OEEH REP^O 
OUCEO EXACTLY AS RECEIVEO FROM 
THE PERSON OR ORGANIZATION ORIG 
inating it POINTS OF VIEW OR opin 
IONS STATEO 00 NOT NECESSARILY 

represent OFFICIAL office of eou 

CATION POSITION OR POLICY 




COLLEGE ENTRANCE EXAMINATION BOARD 
RESEARCH AND DEVELOPMENT REPORTS 

RDR-72-73, NO.1 

• 

RESEARCH BULLETIN 
RB-72-26 JULY 1972 



The Development of 
Statistical Indices 
for Detecting Cheaters 
• 

William H. Angolf 



'i 

. J _ 



EDUCATIONAL TESTING SERVICE 
PRINCETON, NEW JERSEY 
BERKELEY, CALIFORNIA 



THE DEVELOPMENT OF STATISTICAL INDICES FOR DETECTING CHEATERS 1 

The problems of cheating during test administrations may be dealt with in 

' ■ f 

one or both of two ways : by discouraging and deterring cheating before it takes 

place and by detecting it and taking corrective action after it takes place. 
yMost of the deterrent procedures are fairly obvious: They include identity 

checks, close supervision during the test, the use of two or more forms of the 
test distributed randomly throughout the testing room, planned seating arrange- 
ments to make cheating difficult, threats of punishment for detected cheating, 
etc. Methods of detecting individual cases of cheating fall into two general 
categories, depending on whether impersonation or copying was the method of 
cheating employed. 

One solution to the impersonation problem is fairly straightforward, in 
theory, though often difficult to implement: One compares the handwriting 

shown on the suspect answer sheet with authentic specimens of handwriting. . 
Here, of course, judgment plays an important role, and it is sometimes necessary 
to enlist the help of a handwriting expert. 

Methods of detecting copying are also difficult, particularly after the 
test session is over and the answer sheets have been turned in. The obvious 
method is to compare the responses on the suspect answer sheet with responses 
on the answer sheets of examinees seated nearby and to look for greater -than- 
normal similarities. But the question of establishing the range of "normal" 
similarities itself presents a problem. One solution that suggests itself* 

1 This research was supported by the College Entrance Examination Board. 

The author wishes to express his appreciation to the ETS Board of Review for 
their/helpful comments and suggestions in reviewing this manuscript: J. E. 

Allaway, J. T. Campbell, F. R. Kling, J. S. Kramer, L. R. Lavine, W. B. Schrader 
R. Ei Smith, E. E. Stewart, and P. W. Williams. 



- 2 - 



involves the construction of a theoretical distribution of identical responses 
that would be expected in random pairs of answer sheets for examinees who are 
known to be honest. However, even a brief consideration of this solution makes 
it clear that the complexities in making theoretical estimates of such a dis- 
tribution axe far too great to make it practical. For example, the easy assump- 
tion that the options of an item are equally attractive, and therefore equally 
probable, is obviously false and unjustified. Therefore, in the construction 
of any distribution of similar responses made by random pairs of honest examinees 
one would have to take into consideration differences in the popularity of the 
responses. Secondly, although it would certainly make the task of developing 
those distributions a much easier one if one could assume that the items were 
uncorrelated, we know that such an assumption is an unreasonable one; the 
correct responses to a test are not uncorrelated. And even if one had reason- 
ably good estimates of those correlations, the task of using them in generating 
the distributions appears to be formidable. When it is further recalled that 
intercorrelations among the patterns of incorrect responses would also have to 
be considered, it becomes even clearer that the task of developing these dis- 
tributions theoretically approaches quite unreasonable proportions. 

The present paper describes an effort to develop distributions of similar 
responses made by pairs of "honest" examinees to use in fut ur e work in detect- 
ing efforts to copy during test administrations. Because of the foregoing 
considerations, however, it was concluded that the only practical way to 
develop these distributions was to do so empirically. The remainder of this 
paper will describe the procedure of developing these distributions and the 
analyses that followed. 
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Sajnples 

Three samples of examinees were drawn from actual test administrations, 
and identical indices were developed for all three samples for comparing the 
answer sheet of each examinee with the answer sheet of each of the other 
examinees in the sample. The first of these samples described below is the 
principal sample in the study and is the basis for the norms used in later 
actual detection work. The other two samples were used only for verifying 
the usefulness of the first sample. 

1. Sample 1 was constructed by selecting every thousandth examinee 
taken fl*om every odd-numbered computer tape from the December 1968 administra- 
tion of the College Board SAT. In the selection of these cases care was taken 
that each examinee came from a different testing center. If an examinee was 
chosen who did in fact come from a center represented by a previously selected 
examinee, he was replaced with the next examinee who came from a unique ce^^. 
This process of selection yielded a sample of 203 examinees. By comparing the 
item responses of each of the 203 with the other 202, it was possible to 
collect data on 20,503 pairs of answer sheets. Since these 203 examinees 
were sitting for the examination in different geographical locations, it was 
impossible for them to copy from one another's paper. In that sense, then, and 
for the purpose of these data, they were "honest" examinees and their responses 
were therefore usable for developing "norms for honest examinees. 

2. Sample 2 was collected in order to check on the hypothesis that 
the answer sheets for examinees tested in the same geographical location 
might show gf&ater similarities than the answer sheets of examinees sitting 
in separate locations, even though they were innocent of improper behavior. 
This hypothesis might be supported by the possibility that, for example. 
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examinees in the same center might have studied and learned the same mis- 
information from the same source. To determine, then, whether such similaritjes 
occur more often than similarities in answer sheets coming; from di fferent test- 
ing rooms, a center with an unblemished security history was chosen and data 
were developed by comparing the responses on each answer sheet in that center 
with the responses on each of the other answer sheets in that center. Since 
there were 122 examinees in that center, it was possible to make 7381 paired 
comparisons . 

Sample 3 was also chosen as a check on the first. The purpose 
of the check was to determine whether data based on a different examinee group, 
responding to a different form of the SAT, might yield a different set of 
results. Clearly, if the "norms" to be developed could not be generalized 
but were unique to the form of the test and unique to the nature of the examinee 
group, then their usefulness in the course of future operational work in the 
detection of cheaters would be substantially diminished. Accordingly a set of 
data was developed by drawing a sample similar to Sample 1, one examinee from 
each of 209 centers, but taken from the March 1969 administration when a dif- 
ferent form of the SAT was given. With 209 examinees in Sample 3, a total of 
21,756 paired comparisons were made. 

Variables 

The observations for each of the variables listed below were derived from 
the examination of the responses of pairs of examinees, where i = one examinee 
in a pair and j = the other examinee in that pair. Parallel sets of variables 
were derived for SAT-rverbal and SAT-mathematical. 

R R = the number of items answered correctly by examinee i times the 

i 0 

number of items answered correctly by examinee j . 
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R = the number of items answered correctly by both i and j . 
ij 

W W = the number of items answered incorrectly by i times the number 

i j 

of items answered incorrectly by j. 

W = the number of items answered incorrectly by both i and j . 
ij 

Q = the number of items answered incorrectly in the same way (i.e., 

ij 

by making the same incorrect response) by both i and j . 

0 0 = the number of items omitted by i times the number of items 

i j 

omitted by j . (Note that an "omit" is defined as a nonresponse to an item 

that appears prior to the last item attempted in the test; hence omits do not 

include items "not reached.") 

0. . = the number of items omitted by both i and j . 
ij 

W. (or W . ), whichever is smaller. 

1 J 

0. (or 0. ) for the examinee whose W. (or W. ) was the smaller. 
i j i J 

S. = W. + 0. . 
i 1 i 

S. . = Q. . + 0. . . 

10 ij iJ 

K = the longest "run" of identically marked incorrect responses and 

ij 

omits. Before defining the "run," it will be useful to define a "succession" 
of items. This is a consecutive block of items in which all items are marked 
(or unmarked) in precisely the same way: correct, incorrect, or omit. The 

"run" is the number of items answered incorrectly in the same way by both i 
and j (i.e., the Q. . ) within the succession plus the number of items 
omitted by both i and j (i.e., the 0.. ) within the succession. (Note 
that although the succession is the length of a consecutive block of items, 
the run within that succession may not be consecutive. Note also that in any 
ij comparison there may be more than one run.) K.. is defined as the longest 
run in an ij comparison. 
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Analyses 

Using the foregoing 12 variables bivariate distributions were prepared 
for the eight indices shown in Table 1, eight for SAT-verbal and eight for 
SAT-mathematical. The intent in developing these indices was that in the 
investigation of an actual case of suspected copying the departure for that 
case of the value on the dependent variable from the mean of the norms group 
would be examined, but only after controlling on the independent variable. The 
value of the dependent variable for that case, or its departure ftom the mean 
of the array, is referred to here as the "index of copying." 

The first phase of the analysis was- conducted in order to evaluate the 
degree to which the norms tables derived from these bivariate distributions could 
be generalized to other data. Accordingly, two sets of covariance analyses were 
conducted: (l) to determine whether the regression systems formed with the data 

of Sample 1 were significantly different from those of Sample 2; and (2) to 
determine whether the regression systems resulting from the data of Sample 1 
were significantly different from those of Sample 5 * The intent of these 
analyses was to determine whether the data of Sample 1, which presumably would 
form the basis for developing the norms, were idiosyncratic in the sense that 
(l) they would behave differently from data collected for noncheaters who were 
assembled for the test administration in the same room; and (2) they would 
behave in a way that was somehow characteristic of the particular form of the 
SAT used at that test administration and/or characteristic of the examinees 
tested at that time. 

The method of analysis of covariance followed the model developed by 
Gulliksen and Wilks (1950 )> in which the regression systems are tested succes- 
sively for differences in errors of estimate, slopes, and intercepts. Tables 
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Table 1 

Description of Copying Indices 



Bivariate 

Distribution 

(index) 



Independent 
Variable (>:) 



Dependent 
Variable (y) 



A 


R.R. 
1 J 


R. . 
ij 


B 


W.W. 
i J 


Q. . 
ij 


C 


W. . 
ij 


Q. . 
ij 


D 


0.0. 
i i 


•H 

o 


E 


w. 

i 




i 


0. 

1 


•H 

O 


G 


s. 

1 


S. . 
1J 


H 


s. 

1 


K. . 
ij 
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2 and 3 summarize these results and show that for the most part the differences 
are indeed significant, some of them far beyond the one per cent level. How- 
ever, in evaluating these results the sizes of the samples on which these 
analyses were based must be kept in mind'. For the purpose of these analyses 
Sample 1 consisted of 20,505 "cases" (i.e., comparisons, which were not 
entirely independent in this study); Sample 2 consisted of 738l "cases," and 
Sample 3 consisted of 21,736 "cases." (The numbers of actual examinees, it 
is recalled, were 203 , 122 , and 209 .) With "sample sizes" of these magnitudes 
even very small differences would have been found to be significant. Indeed, 
detailed examinations of the array means on the dependent variables for these 
three samples at each interval on the independent variables revealed only 
trivial differences. In some very rare instances, as in the data that gave 
the most highly significant results for the tests of intercepts--e.g. , in 
Table 2 , in the test for Index A, Mathematical; also, in Table 3> in the test 
for Index G, Mathematical- -the means of the arrays for the separate samples 
differed by only two and one-half points at most, and even then only when the 
data in the arrays we re sparse and very likely unstable. In the very large 
majority of instances the means for Samples 2 and 5 would have rounded to the 
same whole number as for Sample 1 and would have led to precisely the same 
conclusion as that based on the data for Sample 1 in the disposition of any 
actual security case. Accordingly, it was judged that the data of Sample 1 
would be sufficiently general to use in developing the "norms." 

Validation 

The second phase of the analysis involved an attempt to validate the in- 
dices and to determine, if possible, which one(s) were most useful in identify- 
ing actual cases of copying. From the data already available it was possible 
to determine the extent to which the independent variable (see Table l) involved 

9 
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Table 2 

Analyses of Covariance 
Sample 1 vs. Sample 2 



Bivariate 

Distribution 



Values of Chi Square 



'0 

ERIC 



*One degree of freedom 
**Significant beyond 1$ level 



to 





(index) 


Errors of Estimate* 


Slopes* 


Intercepts* 


Verbal 








A: 


R.R. vs. R. . 
i J ~ iJ 


41.00* 


85.59** 


12 . 00 ** 


B; 


W.W . vs . Q. . 
i J“ ij 


10.42** 


1.19 


1.24 


C: 


W. . vs. Q. . 
ij — ij 


0.75 


9 . 96 ** 


8.09** 


D: 


0 . 0 . vs. 0 . . 
i j“ ij 


456.28** 


52.85** 


27.53** 


E: 


W. vs . Q. . 
i — ^ij 


7.95** 


1.04 


2.10 


F: 


0 . vs . 0 . . 
i — ij 


94.64** 


208.77** 


570 . 76 ** 


G: 


S . vs . S . . 
i — ij 


2.81 


18 . 65 ** 


107.38** 


H: 


S. vs. K. . 
i — ij 


37.50** 


20.51** 


56 . 10 ** 


Mathematical 








A: 


R.R. vs. R. . 
i J — ij 


0.01 


552.91** 


1496.50** 


B: 


W.W. vs. Q. . 
i A — ij 


0.07 


10 . 88 ** - 


209.49** 


C: 


V/. . vs. Q. . 
ij — ij 


7.48** 


3.54 


155 . 25 ** 


D: 


0 . 0 . vs . 0 . . 
i J — ij 


1530.57** 


375 . 65 ** 


52.25** 


E: 


V/. . vs. Q. . 
ij — ij 


9.32** 


11 . 50 ** 


156.15** 


F: 


0 . vs . 0 . . 
i — ij 


1512 . 9 ^** 


62 . 99 ** 


410.25** 


G: 


S. vs. S. . 
i — ij 


9.93** 


48.12** 


74.08** 


H: 


S. vs. K. . 
i — ij 


20.40** 


4.54 


65 . 51 ** 
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Table 5 

Analyses of Covariance 
Sample 1 vs . Sample 3 



Bivariate 

Distribution 

(index) 


Values 


of Chi Square 




Errors of Estimate* 


Slopes* 


Intercepts' 


Verbal 








A: 


R.R. vs . R. . 
i J — ij 


1.92 


7-57** 


427.87** 


B: 


W.W. vs. Q. . 


4.35 


22.56** 


54.22** 


C: 


V/. . vs . Q. . 
ij — ij 


0.05 


1.72 


30.89** 


D: 


0. 0. vs . 0. . 
1 J iJ 


264.32** 


5.14 


43.61** 


E: 


W. vs. Q. . 
i — ij 


0.02 


28.24** 


l6.4l«* 


F: 


0. vs. 0. . 
i — iJ 


882.40** 


15.15** 


5.63 


G: 


S . vs . S . . 
i — iJ 


185.06*-* 


0.02 


1.02 


H: 


S . vs . K. . 
i — iJ 


105.79** 


5.55 


4.28 


Mathematical 








A: 


R.R. vs. R. . 
i J — ij 


66.50** 


0.68 


36.97** 


B: 


W.W. vs. Q. . 
1 J ‘ 1J 


122.96** 


26.83** 


882.41** 


C: 


W. . vs. Q. . 
ij — ij 


50.05** 


2.68 


685.76** 


D: 


0.0. vs . 0. . 
1 J iJ 


2192.11** 


100.84** 


842.49** 


E: 


W. vs. Q. . 
i — ij 


27 .89** 


5.52 


693. 64** 


F: 


0. vs. 0. . 
i — iJ 


105.08*-* 


168.28** 


806.22** 


G: 


S . vs . S . . 
i — ij 


2.16 


45.15** 


1071.05** 


H: 


S . vs . K. . 
1 — iJ 


64.57** 


10.87** 


397.50** 



*0ne degree of freedom 
^■Significant beyond 1% level 

O 

ERIC 
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in each of the indices was useful as a control when referring to the informa- 
tion provided by the dependent variable. Table 4 provides this information 
in the form of correlation coefficients between the independent and dependent 
variables involved in each index. Correlations are given for each of Samples 
1, 2, and 3, for the verbal and mathematical sections of the test. 

The validities of the separate indices cannot be anticipated from these 

correlations, however, but need to be determined empirically against an 

independent criterion of known copying. To this end answer sheets for a 

group of 5C cases of known and admitted copiers from recent administrations 

were assembled, together with answer sheets for the individuals from whom they 

copied. For each case l6 t-values, 8 verbal and 8 math, were calculated, based 

on Sample 1 data, each describing the deviation of the "index of copying" of 

that case from the mean of the appropriate array. For example, in considering 

the verbal index, W.W. vs. Q. . , the product W W, was calculated, representing 

the number of items person a was observed to answer incorrectly (w ) times 

the number of items answered incorrectly by the person from whose answer sheet 

he admitted copying (tf ) . Then the value , the number of items that 

person a and person b were observed to answer incorrectly in the same 

way — for example, by marking response position d when c was correct--was 

recorded. Referring to the bivariate distribution of W.W. vs_. Q. . for 

Sample 1, the particular arrry of Q. . was examined for this interval of 

^ J 

W.W . . The value, t = (Q , - Q. .)/s n .. , was then determined. As 

i j ao ij H- . *w.w 

J J 

alr eady mentioned, l6 t-values of this sort were calculated, 8 for SAT-verbal 
and 8 for SAT -mathematical, corresponding to the scatterplots described above 
in Table 1. The rule was adopted in advance that any ab comparison for which 
any one of the l6 t-values equalled or exceeded 3*0 represented a validation 

12 
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of the general procedure. The result of a tabulation of these t-values 
•revealed that every one of the 50 cases was identified as a copying case by 
at least one of the l6 indices, %tfith most of the t-values ranging from 3*0 
to about 23.0 (there was one additional t-value of 27.5 and still another of 
t-5.01 ) . 



The question remained, which of these types of indices were most useful, 
in terms of their statistical and practical value, for use in operational 
detection? To answer this question a count was made of the number of times 
these copying cases were actually detected by each of the eight types of 
indices. These frequencies of detection are reported in Table 5* 

The first and second columns of frequencies in Table 5 report the number 
of t-values equalling or exceeding 3.0 for each of the eight indices in the 
Verbal and in the Mathematical sections of the test. The third column merely 
gives the sums of the frequencies in the first two columns. Finally, since 
not all of these 50 students necessarily copied on both sections of the test-- 
some appeared to have copied on the verbal section only, others on the 
mathematical section only- -the last column shows the number of cases in the 
group of 50 that would have been detected on Verbal and/or Math by each of 
the eight indices. 

It appears from an examination of these frequencies that the most success- 
ful indices were those involving counts of Rights and those involving counts of 
Wrongs, especially Index A (R.R. vs. R. .) , Index B (W.W. vs. Q. .) , Index E 

X J 1J X J lj 

(W. vs. Q_) , Index G (S.^ vs. S„) , and Index H (S i vs. K^) . The least 

successful were those involving counts of Omits: Index D (0.0. vs. 0. .) and 

X J 1J 

Index F (0. vs. 0. ) . In order to reduce the number of indices to a manage- 
* 1 j 

able size for operational work , Indices D and F were therefore eliminated from 
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. Table 5 

Frequencies of Detection of Actual Copying Cases 



Frequencies 



Verbal Verbal 

plus and/or 





Index 


Verbal 


Math 


Math 


Math 


A: 


R.R. vs. R. . 
i J — iJ 


54 


57 


71 


44 


B: 


W.W. vs. Q. . 
i j — ij 


44 


57 


8i 


47 


C: 


W. . vs. Q. . 
ij — ij 


57 


22 


59 


4i 


D: 


0.0. vs. 0. . 
i j — id 


8 


l4 


22 


18 


E: 


W. vs. Q. . 
i — ij 


44 


52 


76 


47 


F: 


0. vs. 0. . 
i — ij 


2 


6 


8 


7 


G: 


S . vs . S . . 
i — ij 


40 


56 


76 


48 


H: 


S . vs . K. . 
i — ij 


4i 


4l 


82 


49 



O 
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consideration. Index C (W. . vs. Q. .) , which appeared from the frequencies 

1 J 1 *] 

in Table 5 to be somewhat less useful than those initially listed (A, B, F,, 0, 
and H), was also eliminated. This left five indices for further consideration. 
However, five indices were still too many, and there was little question that 
further reduction was needed. 

It seemed likely that quite apart from its statist! cal validity Index A 

(R.R. vs. R. .) might not be as easily defended and justified to the satis- 
i j — ij 

faction of the typical layman as the other indices. The examinee could argue 

in his own behalf that a large number of right answers in common with another 

examinee should be expected since (he could claim) both he and the other 

examinee were able and knowledgeable students. Therefore, if the ultimate 

judgment that cheating has occurred is to be made by nonstatisticians, the 

fact that the R^ -value in his case was significantly higher than the R. . 

for examinees with the same R^R^ ma y. n °t convincing. 

This line of reasoning was considered to be sufficiently persuasive to 

cause the reduction of the number of potential (and presumably face-valid) 

indices to four: B (W. W. vs. Q. .) , E (w. v£. Q. .) , G (S. vs. S . . ) , and 

H (S. vs. K. .) . In order to make further selections among these indices, 
i — 

intercorrelations based on Sample 1 data were run among the errors of estimate 

associated with each index. For example, if Index B is taken as x^ ' b 12 X 2 ’ 

where Q. . is redefined for simplicity's sake as Variable 1 and W. W. is 
ij 1 J 

redefined as Variable 2, and if, similarly. Index G is taken as x_ - h_^x^ , 
the correlation between Index B and Index G can be expressed as 



r BG 



r l3 " *12*25 " r lU r 3^ + r 12 r 2U r $4 



vTTT 5 ’ 

12 



r- A - rf 



2 

3b 



O 

ERIC 
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Table 6 /gives the intercorrelations among Indices B, E, G, and H. Correlations 

among the indices for SAT-verbal appear above the diagonal; correlations among 

the indices for SAT-mathematical appear below the diagonal. 

From the correlations in Table 6 it appears that the overlap between 

Index B and Index E is sufficiently great (r = .905 for verbal; r = .916 for 

math) to warrant dropping one of them. Both of these indices, it is recalled, 

depended cu an examination of Q. . , the number of items answered incorrectly, 

* 0 

and in ' ••• same way, by both examinees in the comparison. What distinguishes 

Index B from Index E, it is recalled, is that the former uses VI. VI. , the product 

* 0 

of the numbers of wrong responses by i and j , as the control variable and 
that the latter uses , the number of wrong responses made by examinee i 
or j , whichever is smaller. Index B appeared on a priori grounds to be the 
more attractive index because it took into consideration information based on 
both candidates, rather than just one. Tt also derives from the logic, as 
suggested by Saupe (i960), that the expected value of Q. . is the value 



H ± W A > where K = no. cf items in the test. (Saupe actually developed this 

point in terms of the values, R. . and R.R. .) On the other hand it is 

10 1 0 

worth considering that the expected variance of Qj. should depend on the 

particular values of Vi. and VI. separately, since the smaller of the two 

i J 

values imposes an upper limit on Q. . ; when VI. Vi. is 400, for example, the 

^■t] J 

value of Q. . could be as high as 20 if W. and W. are each 20, but only 
i j i ij 

as high as 10 if W. were 10 and VI. were 40. Ultimately, the decision was 

-E t) 

made to use Index B (VI.W. vs. Q. .) , in preference to Index E (VI. vs. Q. .) 

i 0 10 i * «■ 

on the basis that it identified the known copiers with more consistency than 
Index E (see Table 4). With Index E eliminated, the remaining three indices, 
B, G, and H, were reduced to two, B and H, largely on the basis of the lower 
correlations of B with H (.582 for SAT-verbal and .516 for SAT-math). 
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Table 6 

Intercorrelations among Indices B, E, G, and H 
(Based on Sample 1; N = 20,505) 



B E G H 
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Figure 1 illustrates the sensitivity of Index H ( K. . , controlling on S. 

i- 

in detecting the copying in the validation group. The distribution shown at 
the left describes the degree of variation in Index H to be expected in a group 
of examinees who are not copiers, with t extending from -3cr to +3o". The 
dots shown near the baseline of the graph, most of them to the right of the 
distribution, represent the frequencies of the t-values for the 5& validation 
cases on SAT-verbal. (The eight dots plotted at X = 17 represent t-values 
of 17 or higher for eight examinees in the validation group. Space did not 
permit plotting the higher t-values, which, as mentioned earlier in this paper, 
ranged as high as b ^. ) The appearance of these dots far beyond normally 
expected values of t makes it dramatically clear that most of these 50 ex- 
aminees did indeed copy from a neighbor's answer sheet. Now it is also noted 
that nine of the 'C dots are represented by t-values lower than 3*0, four of 
them lower than .00. Although .Index H fails to show that these nine copied 
on SAT-verbal, it does show (but not in Figure l) that eight of the nine copied 
on SAT-mathematical. Thus, only one of the 50 cases was missed by Index K. 

Application 

Current operational work in detecting copiers depends most heavily on 
Index E ( Q_ , controlling on VMf )• When Index B fails to reveal that a 
suspected examinee has copied from another paper, data for Index H ( K. . , 
controlling on S^ ) are also examined. Data for the other indices may also 
be used, but only in instances of uncertainty. However, experience with 
Indices B and H, even when used alone, has been quite satisfactory. 

Although the security procedures at Educational Testing Service are under 
constant review and refinement, they are subject to a philosophy that is 

« . • 



as expected in a group of "honest" examinees and as found 
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intentionally and explicitly permissive. No candidate who is suspected of 
copying is investigated further in operational work unless one of the indices 
in use departs from the mean of the appropriate array in the data for Sample 1 
by 3.72 standard deviations or more, representing a confidence level of less 
than 1 in 10,000 (assuming normal distributions in the arrays). Thus, only 
if an examinee's paper shows such a strong similarity to another examinee's 
paper that such an occurrence "would be observed less than once in 10,000 in 
comparisons made of the papers of honest examinees would the investigation of 
the examinee's case be continued. (Lists of smoothed values, used to implement 
these procedures, are shown for illustration as Tables 7 and 8, below.) In the 
course of this investigation the examinee may be asked to take a retest to con- 
firm his questioned score. If he agrees, arrangements are made for retesting 
under standard conditions and he is given the same form of the test on which 
he received the questioned score. If, on this retest, he earns a score mere 
than 100 points lower than the questioned score, then the questioned score is 
cancelled. Otherwise the questioned score is confirmed. All communications 
and arrangements for retesting are made privately between the examinee and STS . 
Information regarding the events is withheld from the examinee's high school 
and colleges of application, except on the initiative of the examinee himself. 

Summary 

Comparison data on SAT-verbal and mathematical were collected on pairs of 
examinees in three samples for later use in detecting instances of willful 
copying. Two of the samples were constructed with the knowledge that no examinee 
could possibly have copied from the answer sheet of any other examinee in the 
sample. The third sample was taken entirely from a single center believed to 
be free of cheating. In each sample the answer sheet of each examinee was 

0-1 
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Table 7 

Decision Points for Index B* 



SAT- Verbal 


SAT -Hath emat i cal 


VMf . 


Q. . 




Q. . 


o- 99 


4 


50- 99 


6 


loo- 199 


5 


100- 149 


7 


200- 299 


6 


150- 199 


8 


500- 599 


7 


200- 299 


O 


4oo- 499 


8 


500- 549 


10 


500- 599 


9 


550- 449 


11 


6oo- 699 


10 


450- 5^9 


12 


700- 799 


11 


550- 649 


13 


800- 899 


12 


650- 799 


14 


900-1099 


13 


800- 899 


15 


1100-1299 


14 


900-1049 


16 


1500-1499 


15 


1050-1199 


17 


1500-1699 


16 


1200-1549 


18 


1700-1999 


17 


1550-1499 


19 


2000-2199 


18 


1500-1649 


20 


2200-2499 


19 


1650-1799 


21 


2500-2799 


20 


1800-1949 


22 


2800-5099 


21 


1950-2099 


23 


5100-5499 


22 






5500-4099 


25 







^Defined as occurring in "honest" comparisons 
no more frequently than once in 10,000 times. 
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Table 8 

Decision Points for Index H* 



SAT- 


•Verbal 


SAT-Mathema 


tical 


S. 

1 


K. . 
.21 


S. 

l 


K. . 


1- 8 


2 


1- 8 


3 


9-21 


3 


9-18 


4 


22-34 


4 


19-40 


5 


35-47 


5 






48-60 


6 







*Defined as occurring in "honest" 
comparisons no more frequently than once 
in i0,000 times. 
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comp ared with the answer sheet of every other examinee. Eight detection 
indices were developed and distributions were run for possible operational 
use in malting future judgments regarding examinees who were actually suspected 
of copying. Covariance analyses between samples indicated statistical but 
not practical significance, and consequently it was judged that any one of 
the samples could serve the purposes of operational detection as well as 
either of the other two. 

Empirical tryout of the indices against known and admitted copiers gave 
some results which permitted the elimination of three of the indices from 
further use. Practical considerations removed a fourth, and further statis- 
tical study eliminated two others. The remaining two have been in successful 
operational use at Educational Testing Service for more than two years. 
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